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ABSTRACT 


The solution of speaker independent isolated word recognition using vector 
quantization and hidden Markov model based analysis along with front end 
processing is presented in this thesis. Both the vector quantizer and the hidden 
Markov models need to be trained for the vocabulary to be recognized. In this 
case such training has resulted in a distinct hidden Markov model for each word 
in a vocabulary . Recognition consist of computation of probability for each 
word and selecting the highest. 

In this thesis linear predictive coding (LPC) analysis is done in the fron- 
t end processor to convert the speech signal of a frame into some parametric 
representation (cepstral coefficient). This results in a series of vectors character- 
istic of time varying spectral parameters of the speech signal. These vectors are 
grouped into discrete sets by k-means clustering algorithm. During the training 
process, different HMMs are modeled for different words in vocabulary. During 
the recognition process Viterbi algorithm is used to determine the HMM within 
the sets of HMMs that best matches with the observation sequence. 
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Chapter 1 


Introduction 


Real world processes generally produce observable output which can be char- 
acterized as signal. The signals may be discrete in nature (e.g character from 
a finite alphabet, quantized vector from a code book, etc) or continuous (e.g 
speech samples, temperature measurement etc). The signal source may be sta- 
tionary (i.e. its statistical properties do not vary with time) or non stationary. 
The signal may be pure (i.e. coming out from a signal source) or corrupted 
from other signal sources (e.g noise) or by transmission distortions. The speech 
signal production may be viewed as a form of filtering in which a sound source 
excites a vocal tract filter. 

The speech signal is slowly time varying signal in the sense that, when 
examined over a sufficiently short period of time (between 5 and 100msec ), its 
characteristics are fairly stationary. However, over long period of time (on an 
order of 1/5 seconds or more) the signal characteristics change. Speech signal is 
considered non stationary in the sense that its characteristics vary with time[l]. 
Certain characteristics of speech signal are worth mentioning: 

• Voiced sounds during vowels are characterized by quasi-periodicity, low 
frequency content and large amplitude. 
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• Unvoiced sounds and fricatives are characterized by randomness, high 
frequency content, and relatively low amplitude. 

• The transition between voiced and unvoiced sound is gradual. 

To classify the types of signal model, we have to first characterize the time vary- 
ing signal into distinct classes. There are broadly two different kind of signal 
classes, deterministic signal and random signal. In a deterministic signal there 
is no uncertainty with respect to its value at any time i.e these signals may be 
modeled as completely specified function of time. On the other hand in a ran- 
dom signal, there is some degree of uncertainty before it actually occurs. Based 
on this classification, a signal model may be characterized as deterministic mod- 
el and statistical model. In the case of deterministic model, the specification 
of signal model requires characteristic properties such as amplitude, frequency, 
phase of sine wave; amplitude and rate of exponentials. In the case of sta- 
tistical model, one tries to model the statistical nature of the signal. In this 
case, the signal is assumed to be non-stationary and possesses some well defined 
parameters which can be estimated in well defined manner. 

Signal processing is the first step used to solve the speech recognition prob- 
lem. In this process relevant information from the speech signal is extracted 
in an efficient and robust manner. During this step, spectral analysis is used 
to characterize the time varying properties of the speech signal. The proper- 
ties of speech signal are stationary for a very small duration of time and they 
change over a long interval. The short time properties of signal can be con- 
veniently represented by the spectral measurement vectors. There are many 
ways by which the spectral analysis can be performed. This includes standard 
methods as measurement of the discrete FFT, LPC, autoregressive/moving av- 
erage (ARM A) [9] etc. In speech modeling we often call this short time spectral 
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vector an observation. The short time spectra of a signal can be described in 
a mathematically consistent framework, thereby offering analytic solution to 
speech problem. 

There are various types of approaches by which isolated word recognition 
can be done. These approaches can be broadly classified as follows: the acoustic 
phonetic approach, the pattern recognition approach and neural network based 
approach [4]. The acoustic phonetic approach is based upon the assumption that 
there exist finite distinctive phonetic units in the spoken language. It requires 
extensive knowledge of acoustic properties of phonetic units. Implementation 
of isolated word recognition system based on this approach is considered in 
detail by Rabiner in [4]. 

In pattern recognition based approach we don’t bother about the acoustic 
properties of the phonetic units. In this approach, the model parameters are 
computed during the training phase. How the actual training is done, is ex- 
plained in chapter 3. During the recognition phase classification of test word 
is done by pattern comparison. In conventional pattern recognition system 
the unknown test token (i.e. pattern) is time aligned in turn to each refer- 
ence pattern via some form of time wrapping procedure, typically, dynamic 
time wrapping (DTW)[11, 6]. Complete function of such conventional recog- 
nition system is explained as follows. The input speech signal after passing 
through the band pass filter is digitized. Then the digitized signal is processed 
through the pre-processing block which provides high frequency pre-emphasis 
to the speech. Then the pre-emphasized signal is blocked into frames and LPC 
analysis is performed on each frame of word thus creating test pattern. This 
test pattern is compared with each reference pattern (obtained via training 
algorithm) using DTW 1 (dynamic time wrapping) alignment algorithm that 

^ce appendix A. 
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simultaneously provides a distance score associated with the alignment. The 
distance scores for all the reference pattern are sent to a decision rule, which 
provides the classification of spoken word. The pattern recognition approach 
to speech recognition has following features: 

1. The method is easy to understand and rich in mathematics. 

2. The performance of the system is sensitive to the amount of training data 
available. 

3. No speech-specific knowledge is used explicitly in the system. 

4. It provides high performance. 

Itakura [17] introduced the DTW for non linear alignment of speech. This 
system was tested on a single speaker and achieved 97.3% accuracy on 200- 
word isolated word task without use of grammar. 

Another statistical approach to word recognition uses the HMM. No such 
direct alignment is performed in HMM system, only an indirect time align- 
ment is obtained based on probability scoring. Rabiner, Levision and Sondhi 
compared the performance of LPC/DTW and HMM/VQ[6]. A test set of data 
consisting of one replication of each of the ten digits by a set of 100 talker was 
used. They found that average word accuracy with LPC/DTW isolated word 
recognizer was 98.5% and with HMM/VQ it was 96.3%. 

Nowadays an approach based on neural network is also popular. A neural 
network, which is basically parallel distribution processing model, is a dense 
interconnection of simple, non-linear, computational elements [4]. Four model 
characteristics must be specified to implement an arbitrary neural network. 

1. number and type of input-The issues involved in the choice of inputs to a 
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neural network are similar to those involved in the choice of features for 
any pattern classification system. 

2. connectivity of the network-This issue involves the size of the network, 

i.e the number of hidden layers and the number of nodes in each layer 
between input and output and the type of interconnection. 

3. choice of the offset-The choice of the threshold, for each computational 
element must be made as the part of the training procedure, which chooses 
values for the interconnected weights and the offset. 

4. choice of nonlinearity-exact choice of nonlinearity is also very important in 
term of network performance. However, non linearity must be continuous 
and differentiable for the training algorithm to be applicable. 

Neural network possess many features which make it attractive for speech recog- 
nition. Some of the features arc given below: 

1. They can readily implement a massive degree of parallel computation. 

2. They intrinsically possess a great deal of robustness or fault tolerance. 

3. The connection weights of the network need not be constrained to be 
fixed: they can be adopted in real time to improve performance. 

4. Because of the non-linearity within each computational element, a suffi- 
ciently large neural network can approximate any non-linearity or non- 
linear dynamical system. 

To be useful for speech recognition, a layered feed forward neural network must 
have a number of properties. First, it should have multiple layers and suffi- 
cient interconnections between units in each of these layers.This is to ensure 
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that the network will have the ability to learn complex nonlinear decision sur- 
face. Second, the network should have the ability to represent relationships 
between events in time. Third, the actual feature or abstraction learned by the 
network should be invariant under translation in time. Fourth, the learning 
procedure should not require precise temporal alignment of the labels that are 
to be learned. 

Time Delay Neural Network (TDNN) architecture satisfies all these crite- 
ria. It is simplest Neural Network structure that incorporates speech pattern 
dynamics. More about TDNN is discussed by Waibel, Hanazawa, Hinton and 
Lang in [5]. They compared the performance of TDNN with HMM for the 1946 
testing token obtained from three speakers and found that TDNN achieves a 
recognition rate of 98.5% correct while the rate achieved by the HMM was 
93.7%. 

We deal only with the implementation of isolated word recognition using 
HMM. Organization of this thesis work is as follows. In chapter 2 basics of 
hidden Markov model arc enumerated, including its three problems. In chapter 
3 strategy of processing of signal is explained and how the viterbi algorithm be 
applied for scoring is shown. In the last chapter results and the model param- 
eters are discussed. In appendix A features of DTW algorithm are explained 
and, in appendix B theory of LPC is given. 



Chapter 2 


Basics of hidden Markov model 


Hidden Markov model is a widely used statistical method for characterizing 
the spectral properties of the frame of a signal pattern. These models are 
also referred to as Markov source or probabilistic function of Markov chain in 
communication literature. The states in a hidden Markov model are associated 
with a set of discrete symbols, with an observation probability assigned to each 
symbol, or are associated with a set of continuous observation with a continuous 
observation density function. Each transition in a state diagram of an hidden 
Markov model also has transition probability associated with it. 

The foundation of HMM methodology is built on the well known estab- 
lished field of statistics and probability theory. Basic theoretical strength of 
HMM is that it combines modeling of stationary stochastic processes (for the 
short time spectra) and the temporal relationship among the processes (via a 
Markov Chain) together in a well defined probability space. All the features 
and their underlying basic concepts are discussed in this chapter, starting with 
the concept of Markov Process. 


7 



2.1 Discrete-Time Markov Processes 


Discrete time Markov process is the basic concept of hidden Markov model. 
Discrete time Markov process for N (number of states) equal to five states is 
explained as follows. Consider a system that may be described at any time as 
being in one of a set of N distinct states indexed by i {i = 1,2,. . . .N}, as 
shown in fig 2.1. At regularly spaced, discrete times, the system undergoes a 
change of state according to a set of probabilities associated with the state. Let 
qt denotes a state at the time t. In the case of Markoven process transition 
from one state to another depends upon just previous state only i.e transition 
from state q t ~i to state q t is independent of previous states q t ~ 2 , Qt - 3 and so 
on. Mathematically it can be written as 

P[Qt = j\Qt- 1 = i, Qt-i = ] = P[q t= j\q t -. l = i ] (2.1) 

Let the transition probability from q t -\ to q t is represented by .i.e state 
transition probability is the probability that the system is in state j at time 
instant t, given that the system is in state i at time t-1. 

<Hi = P[qt = ikt-1 = i] 1 < i,j < N (2.2) 

State transition probability has following properties 

oy>0 V;, i (2.3) 

= ^ (2*4) 

3 = i 

since they obey standard stochastic constrains. 
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Figure 2.1: Markov chain 

The above five states Markov chain can be represented by the matrix given 
below. 


Oil 

012 

Ol3 

Ol4 

Ol5 

021 

022 

O 23 

024 

O 25 

031 

032 

033 

O34 

O35 

041 

042 

O4 3 

O44 

O45 

051 

052 

053 

O54 

055 


2.2 Type of HMM 

In an ergodic or fully connected HMM every state of the model could be reached 
(in a single step) from every other state of the model. Transition matrix for 
such a model having number of states equal to three is shown below 


on o 12 Oj 3 
a 2 i a 2 2 «23 

031 «32 «33 


( 2 . 6 ) 
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For some application other type of HMMs have been found to perform better 
than the standard ergodic model. One such model is a left right model. This 
model has the property that as the time increases, the state index increases i.e 
state is proceed from left to right. The fundamental property of all left right 
HMMs is that state transition coefficients have the property 

o-ij = 0 j < i (2.7) 

i.e no transitions are allowed to states whose indices are lower than the current 
state. Further more, the initial state probabilities have the property 


TTi 



t^i 

i=l 


The state transition matrix for such left right model is thus 


(2.8) 


an ai2 a^ 

0 O22 o 2 3 ( 2 . 9 ) 

0 0 033 

For the last state in a left right model, the state transition coefficients are 
specified as 

a>NN = 1 ) ( 2 . 10 ) 


o/Vjf = 0 j < N 


( 2 . 11 ) 


Here we have dichotomized HMMs into ergodic and left right models, there 
are many possible variations and combinations possible. These variations are 
obtained by placing some constrains on transition probability matrix. 
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2.3 Elements of HMM 


Hidden Markov model is characterized by its three model parameters i.e state 
transition probability, observation symbol probability and initial state proba- 
bility. We formally define the elements of HMM as: 

• N, the number of states in the model. In HMMs we shall not rigorously 
define what a state is, but simply say that within the state the signal 
possesses some measurable and distinctive properties. The individual 
state are labeled as {1,2,3. . . N} and the state at any time t is denoted 
as q t . 

• At each clock time t, a new state is entered based upon a transition prob- 
ability distribution which depends on just previous state (the Markoven 
property). The state transition probability distribution is represented as 
A={a,y) where 


dij = P[qt + 1 = j\Qt = *], 1 < *, j < N ( 2 . 12 ) 

• M, denotes the number of distinct observation symbols per state -i.e the 
discrete alphabets size. The discrete symbol is represented as V={ui, u 2 , . . .u M }. 
The observation sequence produced by a system is represented by O = 

{oi, o 2 , . . . ox}where o t corresponds to one of the symbol from the discrete 
symbol set V. 

• After each transition is made, an observation output symbol o t is pro- 
duced according to observation probability distribution B={bj(k)}. This 
probability distribution is held fixed for the state regardless of when and 
how the state is entered. The probability of emitting a symbol Vh at time 
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t , in the state j is denoted as 


bj{k) = p l°t = Vk\qt = j), 1 < k < M (2.13) 


• The initial state distribution it = { 7 ^} in which 


7T{ = P[qi = £], 1 < i < N. (2-14) 

Hence the complete specification of an HMM requires: specification of two pa- 
rameters, N and M , specification of observation symbols, and the specification 
of the three sets of probability measures A,B, and 7 r . The complete set of 
parameter is represented as 


A = (A, B,tc) (2.15) 

for a model. 


2.4 Three basics problems for HMM 

There are three basic problems of interest, that must be solved for the model to 
be useful in real -world applications^]. These problems are: 

• Problem T. Given the observation sequence O = 01 O 2 ■ ■ ■ or , and a model 
A = (A, B,tt), how do we efficiently compute P(0|A), i.e the probability 
of the observation sequence, given the modal ? 

• Problem 2 : Given the observation sequence O = Oi, 02 , . . . .oy and the 

model A, how do we choose a corresponding state sequence q = qiq 2 qr 


12 



which is optimum in some meaningful sense (i.e best “explains” the ob- 
servation )? 

• Problem 3 : How do we adjust the model parameter A =(A,B,Tr) to 
maximize P(0|A) ? 

Problem 1 is the evaluation problem. The problem is one of scoring, how well 
a given model matches a given observation sequence. Problem 2 is the one 
in which we attempt to uncover the hidden part of the model i.e., to find the 
correct state sequence. Problem 3 is one in which we attempt to optimize the 
model parameters to best describe how a given observation sequence comes 
about. 


2.5 Solutions to the three problems 

The solutions to basic three problems are mathematically linked to each other 
under probabilistic frame work. 

2.5.1 Problem 1 -Probability evaluation 

It is the problem of computing the probability of observation sequence O = 
( 01,02 or), given the model A i.e P(0|A). Consider the state sequence as 

9 = ten 92, Qt), (2.16) 

where qi is the initial state. The probability of the observation sequence O, 
given the state sequences q, is computed as 

P(0\q, A) = ft A) (2.17) 

t= 1 
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here the observation sequences is assumed to be statistically independent . Thus 
we get 

P(0\q, A) = b qi (ox). 6,2(02) b qT (o T ) (2.18) 

The probability of such a state sequence can be written as 

P(«|A) = 7T„ 

G 'qiQ2^ J Q2<]3 & qr-lQT * (2.19) 

The joint probability of O and q , i.e the probability that O and q occur simul- 
taneously is written as 


P(0, g |A) = P(OM)P( g |A). (2.20) 


The probability of observation sequence O , given the model, is obtained by 
summing this joint probability over all possible state sequences q , giving 


P{0 |A) = E„,,,P(0| g ,A)P( g |A) 

{ 2 . 21 ) 

'h'9192- • 9T ^91^91 (^l)®9l92^92(^ 2 ) '• ®9 t-i9t^9t(®T') 

The calculation of P(0|A), according to the direct definition given in 2.21 
involves on the order of 2 *T * N T . Therefore a more efficient calculation pro- 
cedure is required to solve the problem -1. This procedure is called Forward 
Procedure 

The forward Procedure 

Consider the forward variable a^(i) defined as 


a t {i) = P(oi, o 2 , o 3 o t , q t = i\ A) (2.22) 
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i.e., the probability of the partial observation sequence O1O2 o t (until time t) 

and state i at time t, given the model A.. a t {i ) can be solved inductively as 

1. Initialization 

ai(ii) = 7Ti^(oi), 1 < i < N (2.23) 

2. Induction 

a <+i(i)= 1)1 JffJ -1 ( 2 - 24 ) 

.t=i J 

3. Termination 

P(0|A) = £>,.(*)- (2.25) 

1=1 

Step 1 initializes the forward probabilities as the joint probability of state i and 
initial observation o t . Step 2 accounts for how state j is reached at time t + 1 
from the N possible states i, i = 1, 2, . . . N. at time t. Step 3 gives the desired 
calculation of P(0|A) as the sum of the terminal forward variables q;t(*). The 
number of calculations required in this procedure is on the order of iV 2 * T . 

The Backward Procedure 

In a similar manner we can consider a backward variable defined as 

Pt(i) = P{o t +i,o t+ 2 o T \q t = i, A) (2.26) 

i.e., the probability of the partial observation sequence from t+ 1 to the end, 
given the state i at time t and the model A. Again /3 t (i) cab be solved inductively 
as follows 

1. Initialization 

Pr(i) = 1, 1 < t < N. (2.27) 
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2. Induction 


Pt(i) — Y.atMo»i)fi»iU), t=T—i,T—2, . . .1 ( 2 . 28 ) 

i=i 

The initialization step arbitrarily. defines j5r{i) to be 1 for all i .Step 2 shows 
that in order to be in state i at time t, and to account for the observation 
sequence from time t + 1 on, we have to consider all possible states j at time 
t + 1, accounting for the transition from i to j (the term a^- ), as well as the ob- 
servation o t +\ in state j (the bj(o t + 1 ) term), and then account for the remaining 
partial observation from state j (the /3 t +i(j) term). The number of calculations 
required in this procedure is on the order o( N 2 *T only. 

2.5.2 Problem 2 -Optimal state sequence 

There are several ways of solving the problem 2 i.e., to find the optimal sequence 
associated with the given observation sequence . A formal technique for finding 
the single best sequence is based upon dynamic programming method called 
Viterbi algorithm. 

Viterbi Algorithm (VA) 

To find the single best state sequence q = (qi,q 2 ,. . . qr), for the given observa- 
tion sequence O = (oi0 2 Ot), we define the quantity 

6 t (i) = 9i max^P(<?i< 7 2 - • -qt-uQt = i,o l o 2 ...o t |A). (2.29) 

i.e., 5 t (i) is the best sequence along a single path at time t, which accounts for 
the first t observation sequences and ends in state i. The complete procedure 
for finding the sequence is given below 
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1. Initialization 


<5i(i) = TTibi(oi), 1 < i < N (2.30) 


2. Recursion 


■5.0') = 


max 

l<i<N 




2 <t<T 
i<j<n 


(2.31) 


3. Termination 

P* = max [<Jr(z)l (2.32) 

l<i<N 1 WJ v ’ 

In the viterbi algorithm, the computation of S t (i) involves a and b terms which 
are less than one. The value of St(i) gets smaller and smaller as the time starts 
to grow big. Hence an alternative procedure is proposed. 


Alternative viterbi implementation 

It uses the logarithmic value of model parameters . Following steps are used 


1. Pre-procession 


5f i- ln(7Ti) 1 < i < N 


(2.33) 


bi(o t ) = ln[6j(o t )], 1 < i < N, 1 < t < T (2.34) 


aij = ln(oy), 1 < i, j < N (2.35) 

2. Initialization 

?i(<) = lnO^*)) = 5r i + bi( 0l ), 1 < i < N (2.36) 
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3. Recursion 


6 t (i) = ln(6i(i)) maxi<i< w [J t -i(t) + 5y] + bj(o t ) 
2 <t<T, 1 < j < N 


(2.37) 


4. Termination 

P* = max [J T (i)] (2.38) 

The calculations required for this alternative implementation are on the order 
of N 2 T additions. Because the pre-processing needs are to be performed once 
and saved, its cost is negligible for most system. 

2.5.3 Problem 3-Parameter Estimation 

It is the most difficult problem of hidden Markov model. There is no known way 
to analytically solve for the model parameter set that maximizes the probability 
of the observation sequence in closed form. However model A = ( A , B, n) can 
be chosen such that its likelihood, P(0|A), is locally maximized using the iter- 
ative procedure such as the Baum- Welch method (also known as expectation- 
maximization method). In case of speech processing this is often called training 
and the given observation sequence based on which we obtain the model pa- 
rameters is called training sequence. 

Baum- Welch iterative procedure 
Let define £ t (i, j)as 


6(*J) = P(Qt = h qt+i = j\0, A), (2.39) 
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i.e., the probability of a path being in state i at time t and making a transition 
to state j at time t + 1, given the observation sequence and the model. j) 
can be written as 


= - i / a ^ ( '^ A(t>w)/Wj) (2.40) 

E E «t(i) a ii^(°t+i)A+iO') 

t=l j—l 

Let us define 7 t {i) as the probability of state i at time t, given the entire 
observation sequence and the model, Mathematically 

7 '«(*) = ]£& (U), ( 2 - 41 ) 

j=i , 

Hence the re-estimation formulae for A,B,ir in term of these parameters are 
given as 


7r = expected frquency ( number of time ) in state i 
at time (£ = !) = 7 i(i ) 


( 2 . 42 ) 


exp ected number of transition from state i to state j 

Ul <7 < rpectvdnurnovr of transition from staff 1 


E 1 €*(«) 

t=l 

T~1 

E T'dfi 

£ = 1 


( 2 . 43 ) 


bj(k) 


expected number of times in state j and observing symbol Vk 
expected number of time in state j 


T - 1 

E 7t(j) 

T~ 1 

E 7t0) 

t~i 


( 2 . 44 ) 


The re-estimation formula can be derived by maximization of Baum’s auxiliary 
function defined as 
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(2.45) 


W, A) = £P(0, 9 |A> S P(0,5|A), 

<7 

over A. Because 


Q(\',X) > Q( A', A') =» P(0|A) > P(0|A') (2.46) 

we can maximize the function Q(A', A) over A to improve X' in the sense of 
increasing the likelihood P(0|A). Eventually the likelihood function converges 
to a critical point if we iterate the procedure. 

The re-estimation formula is hence represented as 

_ = fjiWAW = 7i(i) ■ (2 . 47) 

E a T (j ) 

j=l 


a 


ij 


T 

E Ot(t) dij bj(ot+i)fit+i(j) 


E a t {i)Pt{i) 

t= 1 


(2.48) 


E “tWPtti) 

bj(k) = (2.49) 

t=l 

Remark 

Successful application of Hidden Markov Model methods usually involves the 
following steps: 

1. define the set of M sound classes for modeling, such as phonemes or words. 
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These sound classes are represented as V={uj t >2 vl}. 

2. for each class, collect a sizable set (the training set) of labeled utterances 
that are known to be in that class. 

3. based on each training set solve the estimation problem to obtain the best 
model A; for each class Vi i — 1, 2, . .L. 

4. during recognition, evaluate P(0|A,), i = 1.2. . .L, for the unknown utter- 
ance sequence and identify the speech that produces observation sequence 
O as class Vj if 

p (0\Xj) = max. P(0|Ai). 

Here We have introduced briefly the steps taken during the implementation of 
HMM. The detailed procedure is discussed in next chapter. 


21 



Chapter 3 


Implementation of HMM for 
isolated word recognition 


In the previous chapter probabilistic framework of Hidden Markov Model has 
been discussed in detail. The complete HMM model is represented as A = 
(A,B,7r). The output distribution {6,*} models the parametric distribution of 
speech event and the transition distribution {ajj}models the duration of these 
events. How the concept of HMM is applied to isolated word recognition system, 
is discussed in this chapter. Initially the strategy of processing is explained 
theoretically and then the process of implementation of HMM is explained. 


3.1 Strategy Of Processing 

The speech signal is slowly time varying signal in the sense that, when ex- 
amined over a sufficiently short duration of time, its characteristics are fairly 
stationary. Therefore it can be represented by linear time invariant models 
such as autoregressive (AR), linear prediction, the short time Fourier transfor- 
m (STFT) or wavelet transform (WT). In this work linear prediction model is 
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considered. LPC provides a good model of the speech signal and works well 
in recognition. However, for the sequence of long duration corresponding to a 
word (about some hundred millisecond to one second), the speech signal is a 
time- variant signal in the amplitude and in frequency components. The speech 
signal corresponding to the whole duration of word is divided into several seg- 
ments each of several tens of milliseconds. Each such segment is called a frame. 
Each frame is represented by the characteristic vector called observation vector 
O f . Here, the cepstral coefficient is used to represent a frame. 

For the discrete HMM we have to design a codebook consisting of finite dis- 
tinct vectors (symbols). Inputs to the isolated word recognition system based 
on discrete HMM, are sequences of discrete symbols chosen from codebook. 
To determine the entries of codebook we proceed as follows. The vector cor- 
responding to each frame of a word is obtained. These vectors are grouped 
into a set. Similarly we determine a set of vectors for each word in vocabu- 
lary. These sets of vectors are combined to give a set of training vectors. Let 
a set {oi Vi = 1, 2, . . 1} denotes the training set of vectors that occurred when 
the words in vocabulary are pronounced. By K-mean clustering algorithm we 
partitioned the vectors, o i: into M different cells. Then centroid of each cell is 
so computed such that the average distortion in replacing each of the training 
set vectors Oi by the closest centroid is minimum. Each centroid so obtained is 
registered in our codebook. Once a codebook is deigned, the mapping between 
continuous vectors and codebook indices is done by nearest distance rule. 

Hidden Markov model is used to model the transition between the frames 
which are non-stationary processes. During training phase, different HMMs are 
defined for different words in vocabulary. In this work six HMMs are defined. 
During recognition, Viterbi algorithum\8, 7] is used to determine the HMM 
within the set of HMMs that best matches the observation sequence by means 
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Speech Signal 


Figure 3.1: The strategy of processing 

of computing the likelihood score of the most likely state sequence in each HMM 
and selects the HMM with the highest likelihood score to be the HMM that 
best matches with this observation. 

The strategy of processing is illustrated in fig 3.1. The steps used are fol- 
lowing. 

• Front end processing, 

• Clustering, 

• Training of HMM, 

• Recognition 

These steps are elaborated in next sections. 

The speech signal contains many frequency components. It becomes nec- 
essary to detect the speech in presence of noise. By accurately detecting the 
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beginning and end of the utterance, the speech data can be kept at minimum. 
This problem is solved by the End point detector. 


3.2 End point detector 

The objective of speech detection is to separate acoustic events of interest 
in a continuously recorded signal from other part of signal. The end point 
detector[14, 15] must have following features. 

• Simple and efficient processing. 

• Reliable location of significant acoustic events. 

• Capability of being applied to varying background silences. 

The algorithm proposed for locating the endpoints of an utterance is based on 
two measures of signal, zero crossing rate and energy. The speech “energy” 
is defined as the sum of the magnitudes of 10ms of speech centered on the 
measurement interval, i.e. 


50 

E{n)= ]T \s(n + i)|, (3.1) 

t=— 50 

where s(n) are the speech samples and it is assumed that sampling frequency 
is 10kHz . The following steps are used in this algorithm. 

1. Obtain the maximum energy E max for the silent portion. 

2. Compare each frame (10ms duration) of recorded data with E max . If 
energy of frame is greater than E max then label that point as the starting 
point of that word. 
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N M W(n) 



P 


Figure 3.2: Front-end processor 


3. Now go on comparing energy of each frame with E max , till E max is less 
than energy of the frame. When E max is greater then label that point as 
end point. 


By means of end point algorithm all background noise after the word or before 
the word are removed to acceptable label. After the end point detector, data 
is available for processing further i.e for determining observation vector. 


3.3 LPC Feature Analysis 

LPC 1 based front-end processor is used in this work for the implementation of 
speech-recognition system. Fig 3.2 shows the block diagram of LPC processor[2, 

4 

The above system is a block processing model in which a frame of N sam- 
ples is processed and a vector of features O t is computed. Following steps are 
1 see Appendix B 
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involved in the processing: 


1. Pre-emphasis- The speech signal is digitized at sampling rate of 10 kHz 
and processed by a first order digital network in order to spectrally flatten 
the signal. Most widely used pre-emphasis is 

H(z) = 1 -az~ x 0.9 < a < 1.0 (3.2) 

In this case, output s(n), is related to the input to the network, s(n), by 
the difference equation 


s(n) = s(n) — a * s(n — 1). (3.3) 

Here the value of a is taken to be 0.95. 

2. Frame Blocking- In this step the pre-emphasized speech signal, s(n), is 
blocked into frames of N samples, with adjacent frames being separated 
by M samples. Here N is taken as 256 and consecutive frames are sepa- 
rated by the 85 samples. If M < N then adjacent frames overlap and the 
resulting LPC spectral estimates will be correlated from frame to frame, 
if¥<JV then LPC spectral estimates from frame to frame will be quite 
smooth . On the other hand if M > N, then there will be no overlapping 
between adjacent frames. Let I th frame of speech is denoted by xi(n), and 
there are L frames within the entire speech signal, then 

x l {n) = s{M*l + n), n = 0, 1, . . N - 1, l = 0, 1. . .L - 1 (3.4) 

3. Frame Windowing- Each frame is windowed i.e multiplied by window so 
as to minimize the signal discontinuities at the beginning and end of each 
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frame. By this means the signal is being tapered to zero at the beginning 
and end of each frame. Let w(n), 0 < n < N — 1 denotes the window 
then the result of windowing is the signal 

xi(n) = xi(n)w(n), 0<n<N — l (3-5) 

Here Hamming Window is used. It is of the form 

/ 2 7TTL \ 

w(n) — 0.54 — 0.46 cos f— — -J , 0 < n < iV — 1 (3.6) 

4. Autocorrelation Analysis- Each frame of windowed signal is next auto- 
correlated to give 

TV— 1— m 

r/(m) = xi(n) * Xi(n + m), m = 0,l,..p (3.7) 

n=0 

where the highest autocorrelation value, p, is the order of the LPC anal- 
ysis. Here the value of p is taken as 10. 

5. LPC Analysis- For each frame, a vector of LPC coefficient is computed. 
The formal method for converting from autocorrelation coefficients to 
an LPC parameter set (for LPC autocorrelation method) is known as 
Durbin’s method 2 . An LPC derived cepstral vector is then computed up 
to the Q th component, where Q > P and Q = 12 is used here. 

6. LPC Parameter Conversion to Cepstral Coefficients- A very important 
LPC parameter set, which can be derived directly from the LPC coefficient 
set, is the LPC cepstral coefficient, c(m). The recursion used is 

co = ln<7 2 (3.8) 
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where a 2 is the gain term in the LPC model. The cepstral coefficients are 
more reliable feature sets for speech recognition than the LPC coefficients. 
Generally the value of Q is taken as 12. 


Complete LPC analysis is shown in fig (3.2). After performing LPC analysis 
frame by frame, we obtain a set of observation vectors. Each vector has a 
dimension of 12. (In this work we have obtained 23 vectors, each of dimension 
12, for a vowel). Similarly we obtained another set of 23 vectors for another 
vowel. Total no of vectors obtained are 23x6 = 138. These vectors are grouped 
into a set of 6 distinct vectors. 


3.4 K-Mean Clustering 

The purpose of vector quantization is to design a codebook containing distinct 

finite set of vectors. After the LPC analysis we obtain a set V = {vi,V2, vl} 

containing L number of training vectors. We have obtained total 138 vectors. 
We want to classify these vectors into M number of different groups (or cells) 
and to determine a centroid Cj for each group (or cell) Cj representing that cell. 
The centroid Cj must be such that, average distortion in cell Cj is minimum, 
i.e to minimize E{d{yl , Cj) \v{ 6 Cj] where v{ denotes the vector belonging to 
cell Cj. 

With the help of K-mean algorithm[10, 16] we determine c 3 for each Cj by 
2 sce appendix B 
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minimizing the average distortion defined by 


1 L 

D =T '£ d (vi,c vi ) ( 311 ) 

u i— 1 

where denotes a vector from set V and denotes the centroid of the cell 
Cj to which vector Vi belongs. The distortion measure used here is 


d(v u <y) = (vi - c t)J ) T (u i - c vj ). 

t t t 


(3.12) 


The centroid of the cell Cj is calculated as : 


Ni 


X>n 


3 n= 1 


(3.13) 


where Nj is total number of vectors in cell Cj. 

The procedure is described in following steps. 

1. Initially partition the set of L vectors into M different cells arbitrarily 
Determine the centroid Cj as 


Ni 






1 < j < M 


(3.14) 


2. Proceed through the set of L vectors and assign the vector v { to the cell 
Cj whose centroid Cj is the nearest to it. Mathematically calculate 


dj = d(vi,Cj) 1 <j<M (3.15) 

and transfer the u t - to cell Cj for which d,j is minimum. 

3. Recalculate the centroid for the cell receiving the new vector and for the 
cell losing the vector. 
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4. Repeat step 2 and 3 until no more reassignments take place. 

In this way a codebook of desired size is obtained. Then during training and 
recognition the continuous vectors are assigned to the index of nearest codebook 
vector. 


3.5 Block diagram of isolated word recognition 

Consider the block diagram (fig. 3.3) for understanding the concept of isolated 
word recognition system. The word for which a HMM is to be designed, is 
uttered repeatedly and the training set of LPC derived cepstral vector is ob- 
tained by front end processor. These observation vectors are compared one by 
one with each vector in codebook and are assigned the index of the nearest 
codebook vector. By this way we obtain the observation sequence (index se- 
quence) O = {oi, o 2 , . . .0?} which is used in training of HMM. During training 
Baum Welch re-estimation algorithm is used. The starting conditions used for 
the re-estimation algorithm are 

aij = 1/N V N = total number of states 

bjk = 1/M V M = total number of vectors in codebook. 

Hence after training we obtain a HMM representing that word. Similarly 
we design different HMMs representing different words. 

During recognition, switch is changed to mode 2. We determine observation 
sequence for the test word and the probability score is calculated using viterbi 
algorithm. 

The complete isolated word recognition model is shown in fig 3.4. We have 
a vocabulary of six words to be recognized and each word is modeled by a 
distinct HMM. For each word in the vocabulary we have a training set of K 
occurrences of each spoken word where each occurrence of the word constitutes 
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Figure 3.3: Conceptual Model for isolated word recognition 


an observation sequence. The observations are some appropriate representation 
of the characteristics of the word. In order to do isolated word recognition, we 
must perform the following: 

1. For each word v in the vocabulary, we must build an HMM A" i.e we must 
estimate the modal parameters (A,B,X) that optimize the likelihood of 
the training set observation vectors of the uth word. 

2. For each unknown word which is to be recognized, the processing de- 
scribed above is carried out and probability is calculated 


v - argmax^y [jP(0| A^] (3.16) 


for each model and HMM is selected accordingly. 

The probability calculation is performed using the Viterbi algorithm and it 
requires on an order of V * N 2 *T computation. 
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Figure 3.4: The complete model for isolated word recognition 
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3.6 Applying the VA to HMMs 


Given the observation sequence, the VA finds the most likely state sequence in a 
given HMM and the likelihood associated with this most likely sequence. At the 
beginning of each search, each state in a given HMM has some predefined like- 
lihood score. To find the most likely transition and update the state likelihood 
score for each state at any time instant, one must find the most likely transition 
coining into a given state. This is done by adding the transition probabilities of 
all the transitions coming into this given state to their corresponding previous 
state likelihood score, and selecting the transition with maximum sum to be 
the most likely transition coming into this state This sum is then added to 
the observation probability assigned by the given state, to current observation 
symbol, in order to form the updated likelihood score for this given state. The 
most likely transition and the state likelihood score are updated for all states in 
recursion. This process is performed recursively until all symbols in the given 
observation sequence are processed. At the end of recursion, the path asso- 
ciated with the state, that has the highest likelihood score, is selected as the 
most likely state sequence for the given observation. That is, if the observation 

sequence is ( 01,02 o*), then for state j at recursion t, one wants to compute 

✓ 

6 t {t) = max^ N {<S t _i(i) + In (a^) + \n{bj(o t ))} (3.17) 

where bj(ot) is the observation probability for o t assigned by state j at recursion 
t , fly is the transition probability of the transition from the state i to state j, 
6ij is the state likelihood score for the state j at a recursion t. 

At the end the of recursion, the path that has the highest likelihood score 
is selected to be the most likely path or state sequence. For determining the 
most likely state sequence, the most likely transition for each state is registered 


34 



at each recursion so that the most likely state sequence can be traced at the 
end of the recursion. 

Applying viterbi to calculate the likelihood score provides the following 
advantages. Viterbi based search is more efficient. If we use a logarithmic 
representation of HMM probabilities, viterbi search becomes extremely efficient 
because the multiplication is reduced to addition. It is also possible to obtain 
state sequence with viterbi algorithm. 
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Chapter 4 


Results And Discussion 


Procedure for the implementation of speech recognition system based on Hidden 
Markov model is explained in previous chapter. System has been designed for 
the recognition of six vowels. These vowels are - e (bet), A (but), a (hot), 
I (bit), O (obey) and U (food). Six discrete HMMs are trained for these six 
vowels each representing one vowel. 

These vowels are recorded in the noise free environment using a sampling 
rate of 10kHz. Speech signals for vowels are shown in fig(4.1- 4.6). LPC is 
used for determining the spectral properties of the signal. According to this 
algorithm the speech samples are first pre-emphasized by using first order filter 
1 — 0.952 -1 . This pre-emphasized sample is then blocked into numbers of small 
frame. Frame size is taken as 256 samples. Consecutive frames are spaced 85 
samples apart. There is overlapping of 171 samples. Each frame is windowed 
by using Hamming Window given as 

w{ri) = 0.54 — 0.46 * cos(2ixn/(N — 1)), where 0 < n < N — 1 (4.1) 

LPC coefficients are computed for each frame and then it is converted into 
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cepstral coefficient. All the vectors obtained after front end processing are 
clustered into six cells by using K-mean algorithm and centroid of each cell is 
computed. Table (4.1) shows the centroid of each cell. The model parameters 
are computed by using Baum- Welch algorithm. Starting conditions used in 
re-estimation algorithm are: 

1. 7Tj = l/N 1 < i < N; 

2. dij = l/N 1 < i, j < AT; 

3. b ik = 1/M l<i<N,l<k<M. 

Following constraints are applied on b{ k 
1. if 

b ik < 1 * lO" 5 ; 
then 

b ik = l* 10- 5 ; 

2 - £*&<* = !; 

Results obtained are given in tables (4. 2-4. 7). 

While determining left-right HMM (discussed in chapter 2) parameters, fol- 
lowing constraints are applied on state transition matrix and on initial transi- 
tion matrix in addition to above constraints. 


1. = 0 j > i; 


2 . 



i^l 
i = 1 


Calculated values of parameters are given in tables (4.8-4.14) 
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cell no 

centroid (cepstral coeff. values) of cell 


0.831882 

-4.642482 

3.991625 

-4.065114 

1 

4.834633 

-2.659011 

0.830542 

-3.114954 


-0.816965 

-1.441938 

-1.811512 

0.605486 


0.756187 

-1.161783 

2.945668 

-6.533916 

2 

-3.190974 

-2.963116 

2.925372 

-1.242985 


-1.590204 

-1.624079 

-1.095897 

0.04612 


0.194863 

-0.6100089 

3.977964 

-5.893832 

3 

-3.19.974 

-4.490866 

4.556600 

-0.298862 


-1.108746 

-2.089710 

-0.991159 

-0.18204 


-0.50897 

-3.796737 

2.486141 

1.650537 

4 

6.558616 

1.119739 

0.634291 

-3.690248 


-1.753764 

-1.398450 

0.706641 

0.072938 



0.707063 

3.137482 

-3.736169 

5 


-2.343838 

3.506834 

1.775789 



-1.925398 

-0.467580 

-0.571358 


0.883796 

1.288458 

2.260686 

-0.589854 

6 

-3.246550 

-1.579590 

-0.713658 

2.495806 


0.431527 

-1.523267 

0.443054 

-0.019359 


Table 4.1: Output of clustering algorithm 
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2.00000e — 01 2.00000e — 01 2.00000e - 01 2.00000e - 01 2.00000e-01 


&ij — 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

bik = 

0.99995e - 00 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

0.99995e - 0() 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.00000e - 01 
2.00000e — 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

0.99995e - OD 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

0.99995e - OD 
l.OOOOOe - 05 
l.OOOOOe -05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

0.99995e - OB 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 


Table 4.2: Reestimation model parameters for vowel “e (bet)” 



2.00000e — 01 2.00000e — 01 2.00000e-01 


7 U = 

2.00000e - 01 2.00000e - 01 


O'ij — 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 


2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 


2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 


2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 


2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe - 05 
0.99995e - OB 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


bik = 

l.OOOOOe - 05 
0.99995e - OD 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


l.OOOOOe - 05 
0.99995e - OD 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


l.OOOOOe - 05 
0.99995e — OD 
l.OOOOOe - 05 
l.OOOOOe -05 
l.OOOOOe - 05 
l.OOOOOe - 05 


l.OOOOOe -05 
0.99995e — OD 
l.OOOOOe -05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe -05 


Table 4.3: Reestimation model parameters for vowel “A (but)” 
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TTt = 

2.00000e - 01 

a ij ~ 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

bik = 

l.OOOOOe - 05 
l.OOOOOe — 05 
0.99995e - 00 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.00000e - 01 

2.00000e - 01 

2.00000e - 01 

/ 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe - 05 
l.OOOOOe - 05 
0.99995e - OD 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.00000e - 01 

2.00000e — 01 
2.00000e - 01 
2.00000e — 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe -05 
l.OOOOOe -05 
0.99995e - OD 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

1 .OOOOOe - 05 
l.OOOOOe - 05 
0.99995e - OP 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.00000e - 01 

2. OOOOOe - 01 
2. OOOOOe - 01 
2. OOOOOe - 01 
2.00000e - 01 
2. OOOOOe - 01 

l.OOOOOe -05 
l.OOOOOe -05 
0.99995e - OD 
l.OOOOOe -05 
l.OOOOOe - 05 
l.OOOOOe -05 

(hot)” 


Table 4.4: Reestimation model parameters for vowel “a 
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7Ti = 


2.00000e - 01 

a ij = 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

bik = 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
0.99995e - OP 
l.OOOOOe -05 
l.OOOOOe -05 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
0.99995e — OD 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
0.99995e - OD 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 
0.99995e - OD 
l.OOOOOe -05 
l.OOOOOe - 05 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
0.99995e - OD 
l.OOOOOe - 05 
l.OOOOOe - 05 


Table 4.5: Reestimation model parameters for vowel “I (bit)” 


45 



t r» = 

2.00000e - 01 2.00000e - 01 


<%ij — 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

bik = 

l.OOOOOe — 05 
l.OOOOOe - 05 
l.OOOOOe -05 
l.OOOOOe -05 
0.99995e - OD 
l.OOOOOe -05 


2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe - 05 
Q.99995e - OD 
l.OOOOOe - 05 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
0.99995e - OD 
l.OOOOOe - 05 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe - 05 
0.99995e - OD 
l.OOOOOe -05 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
0.99995e — OD 
l.OOOOOe - 05 


Table 4.6: Reestimation model parameters for vowel “O (obey)” 
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2.00000e - 01 

a ij ~ 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

bik = 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
0.99995e - 0® 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 
0.99995e - 00 


2.00000e — 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
0.99995e - OD 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 
0.99995e - OD 


2.00000e - 01 

2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 
2.00000e - 01 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe -05 
0.99995e - OD 


Table 4.7: Reestimation model parameters for vowel “U (food)” 



TTi = 

l.OOOOOe — 00 O.OOOOOe — 00 0.00000e-00 O.OOOOOe - 00 0.00000e-00 

a ij = 

9.54210e — 01 2.61651e - 02 9.81194 - e02 5.72363 - e - 03 4.08830e - 03 
O.OOOOOe - 00 5.71423e — 01 2.14882e- 01 1.25001e- 01 8.92867e - 02 

O.OOOOOe -00 O.OOOOOe -00 4.99993e-01 2.9166^- 01 2.08333e - 01 

O.OOOOOe -00 O.OOOOOe -00 O.OOOOOe - 00 5.83333e-01 4.16667e-01 

O.OOOOOe -00 O.OOOOOe -00 O.OOOOOe - 00 O.OOOOOe -00 l.OOOOOe - 00 

bik = 

9.99995e - Of) 9.99995e - 0&- 9.99995e - 0$’ 9.99995e - Ot 9.99995e - Oii 

l.OOOOOe -05 l.OOOOOe -05 l.OOOOOe - 05 l.OOOOOe -05 l.OOOOOe - Oo 

l.OOOOOe -05 l.OOOOOe -05 l.OOOOOe -05 l.OOOOOe -05 l.OOOOOe -05 
l.OOOOOe -05 l.OOOOOe -05 l.OOOOOe - 05 l.OOOOOe -05 l.OOOOOe - 05 

l.OOOOOe -05 l.OOOOOe -05 l.OOOOOe - 05 l.OOOOOe -05 l.OOOOOe -05 

l.OOOOOe -05 l.OOOOOe -05 l.OOOOOe - 05 l.OOOOOe -05 l.OOOOOe -05 

Table 4.8: Reestimation model parameters for vowel “e (bet)” 
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l.OOOOOe — 00 O.OOOOOe — 00 O.OOOOOe - 00 0.00000e-00 0.00000e-00 


a ij ~ 

9.54210e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 


2.61651e — 02 
5.71423e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 


9.81194 -e02 
2.14882e - 01 
4.99993e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 


5.72363 - e - 03 
1.25001e - 01 
2.91666e - 01 
5.83333e - 01 
O.OOOOOe - 00 


4.08830e - 03 
8.92867e - 02 
2.08333e - 01 
4.16667e — 01 
l.OOOOOe -00 


b ik =9.99995e - 01 


l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


l.OOOOOe -05 
9.99995e - 01 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe - 05 
l.OOOOOe - 05 


l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe -05 


l.OOOOOe -05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 


l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


Table 4.9: Reestimation model parameters for vowel “A (but)“ 



l.OOOOOe — 00 O.OOOOOe -00 O.OOOOOe - 00 0.00000e- 00 0.00000e-00 


5.72363 - e - 03 4.08830e - 03 
1.25001e - 01 8.92867e - 02 

2.91666e - 01 2.08333e - 01 

5.83333e - 01 4.16667e - 01 

O.OOOOOe - 00 l.OOOOOe - 00 


a ij — 

9.54210e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 

kik — 

l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.61651e — 02 
5.71423e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 

l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 


9.81194 -e02 
2.14882e - 01 
4.99993e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 

l.OOOOOe -05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe -05 


l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe -05 


l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe - 05 


Table 4.10: Reestimation model parameters for vowel “a (hot)” 



l.OOOOOe — 00 O.OOOOOe — 00 O.OOOOOe - 00 0.00000e- 00 0.00000e-00 


5.72363 - e - 03 4.08830e - 03 
1.25001e - 01 8.92867e - 02 

2.91666e - 01 2.08333e - 01 

5.83333e — 01 4.16667e-01 

O.OOOOOe - 00 l.OOOOOe - 00 


a ij — 

9.54210e — 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 

h k = 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe - 05 


2.61651e — 02 
5.71423e — 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 

l.OOOOOe -05 
l.OOOOOe - 05 
l.OOOOOe -05 
9.99995e - 01 
l.OOOOOe -05 
l.OOOOOe -05 


9.81194 -e02 
2.14882e - 01 
4.99993e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe - 05 


l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe -05 
l.OOOOOe -05 


l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 
l.OOOOOe - 05 


Table 4.11: Rccstimation model parameters for vowel “I (bit)” 

XNTXAL LtBRAT 
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TTt = 


l.OOOOOe — 00 O.OOOOOe — 00 


O.OOOOOe — 00 O.OOOOOe -00 O.OOOOOe -00 


0>ij — 

9.54210e — 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 

bik — 

l.OOOOOe - 05 
l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe -05 


2.61651e — 02 
5.71423e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 

l.OOOOOe -05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 


9.81194 -e02 
2.14882e — 01 
4.99993e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe -05 
l.OOOOOe - 05 

9.99995e — 01 

\ 

l.OOOOOe - 05 


5.72363 - e - 0 
1.25001e — 01 
2.91666e - 01 
5.83333e - 01 
O.OOOOOe - 00 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 


3 4.08830e — 03 
8.92867e - 02 
2.08333e - 01 
4.16667e - 01 
l.OOOOOe - 00 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 
l.OOOOOe - 05 


Table 4.12: Rcestimation model parameters for vowel “O (obey)” 


52 



O.OOOOOe - 00 O.OOOOOe - 00 


5.72363 - e - 03 


7T» = 

l.OOOOOe - 00 

a ij ~ 

9.54210e — 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 

bik = 

l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 


O.OOOOOe - 00 

2.61651e - 02 
5.71423e — 01 
O.OOOOOe - 00 
O.OOOOOe - 00 
O.OOOOOe - 00 

l.OOOOOe -05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe - 05 
9.99995e - 01 


O.OOOOOe - 00 

9.81194 -e02 
2.14882e - 01 
4.99993e - 01 
O.OOOOOe - 00 
O.OOOOOe - 00 

l.OOOOOe -05 
l.OOOOOe -05 
l.OOOOOe - 05 
l.OOOOOe - 05 
l.OOOOOe -05 
9.99995e - 01 


1.25001e — 01 
2.91666e - 01 
5.83333e - 01 
O.OOOOOe - 00 


4.08830e - 03 
8.92867e - 02 
2.08333e - 01 
4.16667e - 01 
l.OOOOOe - 00 


l.OOOOOe -05 l.OOOOOe -05 


l.OOOOOe -05 l.OOOOOe -05 


l.OOOOOe -05 l.OOOOOe -05 


l.OOOOOe -05 l.OOOOOe -05 


l.OOOOOe -05 l.OOOOOe -05 


9.99995e - 01 9.99995e - 01 


Table 4.13: Reestimation model parameters for vowel “U (food)” 
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Conclusion: 


Isolated word recognition system based on HMM is implemented for the recog- 
nition of six vowels (namely e (bet), A (but), a (hot), I (bit), 0 (obey), U 
(food)). While simulating the system, it is observed that the most important 
part is the estimation of model parameters. Performance of system depends 
upon the value of model parameters. The computed values of models param- 
eters give satisfactory result. Approximately 90% of recognition accuracy has 
been obtained . Performance of system can further be enhanced by using large 
training data because the HMM model estimation algorithms give better es- 
timation of the HMM parameters for large training data. But recording and 
digitization of speech is a lengthy process, hence we have restricted our work 
for small training data. The determined model parameters for left-right type 
HMM gives an accuracy of about 92.45%. 

Greater the number of states, more fine details of speech can be modeled. 
The five states are sufficient for our purpose. The states more than five do not 
lead to significant improvement in performance[6]. After analyzing the state 
symbol probability matrix, it is found that each cluster represents one specific 
vowel. 
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Appendix A 


Dynamic Time Warping Algorithm 


In case of speech recognition system based on pattern comparison technique, 
similarity measurement between the reference pattern and test pattern is an 
important issue. In such system time registration of test pattern and reference 
pattern is important because time scales of test pattern and reference pattern 
are generally not perfectly aligned. Time duration for test pattern is differen- 
t from duration of reference pattern. Hence the importance of DTW lies in 
compression or expansion of test pattern so that the similarity can be mea- 
sured effectively. The mathematical framework for general DTW algorithm is 
explained here. 


Specification Of The DTW Algorithm 

It was assumed that end point of test pattern and reference pattern is known 
prior to DTW algorithm applied. Let the test pattern is denoted as 

T = {T(1), T(2) T(M)} (A.l) 
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where T(m) is spectral vector of input speech at a time m and M is the total 
number of frames of speech. Similarly a set of reference pattern {R l , R 2 , . , . . , R v }, 
where each reference pattern R j is denoted as 


R j = {/2(1), /?(2), R(N)} (A.2) 

where R(n ) is spectral vector at time n and N is total number of frames of 
speech pattern. 

The purpose of DTW algorithm is to find the path 


m = w(n) (A.3) 

in (n, m) plane which is an optimal path i.e which minimizes the total distance 
function D. The function D is given by following formula: 

P=£d(fl(n),r(u>(n))) (A.4) 

n— 1 

where d(R(n), T(w(n ))) is the local distance between frame of reference pattern 
and frame [m = w(n)] of the test pattern. 

Let express both time sixes ( n, m ) in term of common time axis k as 


n = i(k) k = 1,2,. 

..,K 

(A.5) 

m = j(k ) k = 1, 2, . 

..,K 

(A.6) 


where K is length of the common time axis. In order to find the best path in 
the (n, m) plane following factors of DTW must be considered[ll]. 
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1. End Point Constraints: 


In isolated word recognition system, the test pattern and reference pattern have 
well defined end points that mark the beginning and the ending of frames of 
pattern. The end point constraint is of the form: 


«(1) = 1, 

m = i, 

beginning point 

(A.7) 

i(k) = N 

m = m, 

ending point 

(A.8) 


2. Local Continuity Constraint 

To ensure proper time alignment and to avoid excessive loss of information, a 
set of local continuity constraint is applied. A first constraint of type is the 
monotonicity constraint, namely 

i(k + 1) > i(k ) (A. 9) 

j(k + 1) > j(k) (A.10) 

The path P r is defined as sequence of moves, each specified by a pair of co- 
ordinate increments 

where r signifies the rth path and L(r) denotes length of rth path 
is traced in backward direction through L{r) points as 

kthpoint : i(k ) = n, j(k) = m (A. 12) 


(A.11) 
. The path 
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(k - s)th point : i(k - s) = i(k) - a / r) (A. 13) 

i=i 

j(k -s)=j (k) - J2 A (r) (A. 14) 

i=i 


for s = 1, 2, . . . L(r ). 


3. Global Path constraints: 


These constraints resist the path to lie in some region of (n,m) plane. The 
boundary of such region in which optimal path lies, is given by following in- 


equality 


where 


1 + ~ < j(k) < 1 + E max (i(k) — 1) 

f-'max 

M + E max (i(k) — N) < j(k) < M + - ( - ( ^ 


E/max — max 
( r ) 


A(r) , ' 

Z 0 1 

i=l 


L(r) 

E a 

i-i 


(r) 


Emax — rnin 
(r) 


Hr) 

E A 

(=1 


(r) 


Hr) 

E 

,(=i 


(r) 


(A. 15) 
(A. 16) 


(A. 17) 


(A. 18) 


4. Axis Orientation 

Previously we have defined a common time axis k and arbitrary n to i(k) and 
m to j(k). We can also assign 


n = j(k), k = 1,2, K 


(A.19) 


m = i(k), k = 1, 2, K. 


(A. 20) 
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There is no difference between above assignment and previous assignment when 
both the local constraints and distance metric are symmetric. However when 
there is asymmetry in either local constraints or in distance metric, then the 
difference in variable assignment can be significant. 


5. Distance Measure 


The last factor in the DTW is the distance measure between reference pattern 
and test pattern. The distance function is defined as 




£ d(i(k),j(k))W(k) 


n{W) 


(A.21) 


where D(i(k),j(k)) is a function that gives the total distance along the path of 
length K. d(i(k),j(k)) is the local distance between frame i(k) of the reference, 
and j ( k ) of the test. W ( k ) is a weighing function of the kth arc of the path, and 
A^JT) is a normalization factor which is a function of the weighing function 
W. 

The distance computed along the optimal path is minimum. Let this mini- 
mum distance is denoted as D , then 


D = ftr PM*)- i(*)))> ( A - 22 ) 

(K, t(fc),j(fc)) 

For the implementation of D the local distance function d, the weighing function 
W , and the normalized factor N(W) must be specified. 

The complete isolated word recognition system based on DTW algorithm is 
shown in figA.l. 

The front end processor provides a set of (p + 1) autocorrelation coefficient 
for each frame. The output of the DTW algorithm for the nth reference word 
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Figure A.l: Block Diagram of LPC based word recognizer using a standard 
DTW algorithm. 

is the distance D ^ , v = 1,2, .R, and the decision rule processes the set of 
D 60 . It selects the reference pattern which gives smallest value of D. 
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Appendix B 


Linear Predictive Coding Of 
Speech 


Linear predictive analysis [13] is one of the most powerful speech signal analysis 
technique. It provides extremely accurate estimation of speech parameters (e.g. 
pitch, formats, spectra, vocal tract area function ). The basic idea behind LPC 
modeling is that speech sample at time n , s(n) can be approximated as a linear 
combination of past p samples i.e., s(n) can be computed as 

s(n) = ais(n - 1) + a 2 s(n - 2) + + a p s(n - p ) (B.l) 

where the coefficients ai,a 2 ,....a p are assumed to be constant over speech 
analysis frame. Based on the model shown in figB.l. the relation between s(n) 
and u(n) (normalized exited source) is given as 

p 

s(n ) = ^2 a ks(n — k) + Gu(n) (B.2) 

Jt=i 

The estimated s(n) is defined as: 
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Figure B.l: Linear Predictive Model for Speech. 


v 

s{n) = Y ks(n - k). (B.3) 

jt=i 

The prediction error e(n) is calculated as 


p 

e(n ) = s(n) — s(n) = s(n) — Y a ks{n - k). (B.4) 

Jt=i 

Now we have to determine a set of predictor coefficients { 0 ^} directly from the 
speech in such a manner so as to obtain good estimate of spectral properties 
of speech signal. The basic approach is to find a set of predictor coefficients 
that will minimize the mean square prediction error over a short segment of the 
speech waveform. 

Let the short time average prediction error is defined as 

E n = Y e K m ) 

m 

= Em Mm) “ 5„(m)] 2 
= Em Mm) - Efc=i a kSn(m - A ;)] 2 
where s n (m) is a segment of speech that has been selected in the vicinity of the 
sample n i.e. 



(B.5) 
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s n (m) = s(m + n) 


(B.7) 


We can find the value of a* that minimizes E n in equation B.4 by setting 


thereby obtaining 


dEn 

da,i 


= 0 


* = 1,2, — p 


(B.8) 


V 

<t>n(i,k) = J2 a ^n{i,k) * = 1,2, p (B.9) 

/c=l 

where <f>(i , k) is defined as 


4>n{i,k) = s n(m- i)s n (m- k) (B.10) 

m 

To solve eqB.9 for the optimum prediction coefficients we have to compute 
< f> n (i , k) for 1 < i < p and 0 < k < p and then solve the resulting set of p 
simultaneous equations. There are two standard methods for solving. 


l.The autocorrelation method 

In this approach s n (m) is assumed to be zero outside the interval 0 < m < N— 1. 
Mathematically 

Sn(w) = s(m + n) * w(m ) (B.ll) 

where w(m) is finite length window (e.g Hamming Window) that is identically 
zero outside the interval 0 < m < N — 1. s n (m) is non zero only for 0 < m < 
N— 1 and the corresponding error, e n (m) for the p th order predictor will be non 
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zero over the interval 0<m<A^-l+p. Thus for this case E n is expressed as 

N+p—l 

E n= Y e n( m ) (B.12) 

771=0 

and eqB.7 becomes 

N-\-p—X 

<f>n(i,k)= Y s n (m - i)s n (m - k) Jfj§g, (B.13) 

m= 0 

it can be reduced to 

N- l-(t-fc) 

Y S n (m)s n (m + i - k) gg, (B.14) 

m= 0 

In this case (f> n {i , k ) is identical to the short time autocorrelation function eval- 
uated for ( i , k). That is 

0»(*,fc) = .Rn(*-*) (B.15) 

where 

N-l-k 

Rn= Y Sn{m)s n (m + k) (B.16) 

T7l=0 

Therefor eq B.9 represented as 

Y a kRn{\i - k\) = Rn(i) 1 <i<v (B.17) 

k = 1 

It can be represented in matrix form as 
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Rn( 0) 

^n(l) 

Rn{ 2) • 

.. Rn{p 1) 


Oi 



R n(l) 

i*n(0) 

i?n(l) •• 

.. Rn{p 2) 


&2 



R «(2) 

Hn(l) 

Rn( 0) •• 

.. Rn(p-3) 



= 


Rn(p~ 1) 

/2n(p-2) 

Rn(P~ 3) •• 

■■ iln(0) 


(Xp 




(B.18) 


The p x p matrix of autocorrelation value is a Toeplize matrix, i.e it is 
symmetric and all elements along a given diagonal are equal. 


2. The covariance method 


The second basic approach to define the speech segment s n (m ) and limits on 
the sum, is to fix the interval over which the mean square error is computed. 
Then if we define 

En = £ (B.19) 

m= 0 

then k ) becomes 

Mh k) = s n (m - i)s n {m - k) Igg, (B.20) 

m= 0 


or by change of variable 

<f>n(i,k)= J2 s n {m)s n (m + i-k) gg, (B.21) 

m= 0 

Now eqB.7 becomes 
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(B.22) 


p 

MM) = ]C a *MM) i = 1,2, p 

k=l 

in matrix form it can be represented as 


MM) MM) Ml, 3) 
M2,l) M 2,2) 0 n (2,3) 
MM) M3, 2) 0 n (3, 3) 


Mp, 1) Mp, 2) Mp, 3) •• 


MM) 


ai 


MM) 

^n(2, p) 


a 2 


MM) 

M 3 ,P) 


a 3 

= 

M 3,0) 

Mp>p) 


dp 


MP> 0) 

(B 


The resulting matrix can be solved by Cholesky Decomposition. 


Here we are interested in the solution of autocorrelation matrix. 


3.Durbin’s recursive solution for the autocorrelation equation 

For autocorrelation method the matrix equationB.18 can be solved by several 
efficient recursion procedures. Durbin’s recursive procedure is given as 
= R{ 0) 

hi = - Ej=\ a^Rii - j)] /E«-» 1 <i<p 

a\ l) = ki 

afj^ — — ki* i < j < i — 1 

E® = (1 - k^E^ 

Above equations are solved recursively for i = 1, 2, . . .p and the final solution 
is given as 

a,j = ' 1 < j < p. (B.24) 
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Durbin’s recursive procedure is the most efficient procedure known for solving 
this system of equations. 
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